Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

274 ◾ Bioinformatics

website at “https://dataguide.nlm.nih.gov/edirect/install.html”. You can use the following

script to use EDirect to create a metadata file. The script searches the NCBI SRA database

for the BioProject “PRJEB24421” and then it retrieves the sample metadata and stores them

in a TSV file “sample-metadata.tsv”.

esearch -db sra -query ‘PRJEB24421[bioproject]’ \

| efetch -format runinfo \

| tr -s ‘,’ ‘\t’ > sample-metadata.tsv

Then, you can edit the file as above.

7.3.3.3 Importing Microbiome Yoga Data

Our example raw data is non-Casava 1.8 demultiplexed reads. To import the FASTQ files

into QIIME2 artifact, we need a manifest file listing the file names and their absolute path

as described above. We can create the manifest file manually or we can use the following

bash script. Before running the script, change to the “data” directory “cd data”, where the

FASTQ files are found.

#Creating a manifest file

###############################

#a- make file name and absolute path

find “$PWD”/*.fastq -type f -printf ‘%f %h/%f\n’ > tmp.txt

#b- remove _1/2.fastq

awk ‘{ gsub(/_[12].fastq/,”,”, $1); print } ‘ tmp.txt > tmp2.txt

#remove space

cat tmp2.txt | sed -r ‘s/\s+//g’ > tmp3.txt

n=$(ls -l *1.fastq|wc -l)

#create a direction column

seq $n | sed “c forward\nreverse” > tmp4.txt

#add direction column

paste tmp3.txt tmp4.txt | column -s $’’ -t > tmp5.txt

#replace space with comma

sed -e ‘s/\s\+/,/g’ tmp5.txt > manifest.txt

#add column names

sed -i ‘1s/^/sample-id,absolute-filepath,direction\n/’ manifest.

txt

rm tmp*.txt

The “manifest.txt” file will be created in “data” directory, and it looks as shown in

Figure 7.6.

After running the above script, you can display the file content using the text editor of

your choice. Then, move back to the project main directory using “cd ..”.

The next step is to import the FASTQ files into a QIIME2 artifact. To keep files orga-

nized, you can create a new subdirectory “input” for the artifact files.

mkdir inputs